fix: add per-domain RequestThrottler for 429 backoff#1762
fix: add per-domain RequestThrottler for 429 backoff#1762MrAliHasan wants to merge 3 commits intoapify:masterfrom
Conversation
Add a new RequestThrottler component that handles HTTP 429 (Too Many Requests) responses on a per-domain basis, preventing the autoscaling death spiral where 429s cause concurrency to increase. Key features: - Per-domain tracking: rate limiting on domain A doesn't affect domain B - Exponential backoff: 2s -> 4s -> 8s -> ... capped at 60s - Retry-After header support (both seconds and HTTP-date formats) - Throttled requests are reclaimed to the queue, not dropped - Backoff resets on successful requests to that domain The AutoscaledPool is completely untouched - throttling happens transparently in BasicCrawler.__run_task_function before processing. Integration points: - BasicCrawler: throttle check, 429 recording, success reset - AbstractHttpCrawler: passes URL + Retry-After to detection - PlaywrightCrawler: passes URL + Retry-After to detection Closes apify#1437
|
Hi @MrAliHasan, thanks for your contribution! We'll try to review this soon. |
There was a problem hiding this comment.
As mentioned in #1762 (comment), the approach of reclaiming throttled requests is not optimal.
On top of that, the solution to #1437 should probably also be extensible enough to also cover #1396 without much tweaking.
I believe that such solution could be implemented in crawlee-python quite easily. See similar issue for crawlee-js. The Python version already supports multiple "unnamed queues" via RequestQueue.open(alias="..."), so you'd only need to implement a ThrottlingRequestManager (implementation of the RequestManager interface) that would keep track of the per-domain queues and their delays.
Do you want to try it?
| @staticmethod | ||
| def _parse_retry_after_header(value: str | None) -> timedelta | None: |
There was a problem hiding this comment.
This has no business being in BasicCrawler. Better put it in the _utils module.
There was a problem hiding this comment.
Moved to crawlee._utils.http.parse_retry_after_header in the refactor commit.
| # Check if this domain is currently rate-limited (429 backoff). | ||
| if self._request_throttler.is_throttled(request.url): | ||
| self._logger.debug( | ||
| f'Request to {request.url} delayed - domain is rate-limited ' | ||
| f'(retry in {self._request_throttler.get_delay(request.url).total_seconds():.1f}s)' | ||
| ) | ||
| await request_manager.reclaim_request(request) | ||
| return |
There was a problem hiding this comment.
If, at some point, the request queue contains only requests from a throttled domain, this will become a busy wait with extra steps. If you're using the Apify platform, this will cost a lot in request queue writes.
I'm afraid that this means we cannot accept the PR in the current state. See the main review comments for possible next steps.
There was a problem hiding this comment.
Fully addressed in the refactor. The reclaim-based throttle block was removed. ThrottlingRequestManager.fetch_next_request() now handles scheduling and awaits asyncio.sleep() when all domains are throttled, eliminating busy-wait and extra queue writes.
|
Thanks for the detailed review. That makes sense regarding the busy-wait behavior and queue writes. |
Move per-domain throttling from execution layer (BasicCrawler.__run_task_function) to scheduling layer (ThrottlingRequestManager.fetch_next_request). - ThrottlingRequestManager wraps RequestQueue, implements RequestManager interface - fetch_next_request() buffers throttled requests and asyncio.sleep()s when all domains are throttled — eliminates busy-wait and unnecessary queue writes - Unified delay mechanism supports both HTTP 429 backoff and robots.txt crawl-delay (apify#1396) - parse_retry_after_header moved to crawlee._utils.http - 23 new tests covering throttling, scheduling, delegation, and crawl-delay Addresses apify#1437, apify#1396
| if self._throttling_manager: | ||
| self._throttling_manager.record_success(request.url) |
There was a problem hiding this comment.
I'd rather use isinstance so that an explicitly configured ThrottlingRequestManager also works.
There was a problem hiding this comment.
Fixed! Updated to check isinstance(manager, ThrottlingRequestManager) accurately.
| Args: | ||
| session: The session used for the request. If None, no check is performed. | ||
| status_code: The HTTP status code to check. | ||
| url: The request URL, used for per-domain rate limit tracking. |
There was a problem hiding this comment.
Perhaps the parameter should be renamed to request_url so that it's not ambiguous with a URL after redirects.
There was a problem hiding this comment.
Done. Renamed to request_url across BasicCrawler, AbstractHttpCrawler, and PlaywrightCrawler.
| """Raise an exception if the given status code indicates the session is blocked. | ||
|
|
||
| If the status code is 429 (Too Many Requests), the domain is recorded as | ||
| rate-limited in the `RequestThrottler` for per-domain backoff. |
| # NOTE: _parse_retry_after_header has been moved to crawlee._utils.http.parse_retry_after_header | ||
|
|
There was a problem hiding this comment.
| # NOTE: _parse_retry_after_header has been moved to crawlee._utils.http.parse_retry_after_header |
| if not robots_txt_file: | ||
| return True | ||
|
|
||
| # Wire robots.txt crawl-delay into ThrottlingRequestManager (#1396). |
There was a problem hiding this comment.
| # Wire robots.txt crawl-delay into ThrottlingRequestManager (#1396). | |
| # Wire robots.txt crawl-delay into ThrottlingRequestManager |
| configuration=self._service_locator.get_configuration(), | ||
| ) | ||
| self._throttling_manager = ThrottlingRequestManager(inner) | ||
| self._request_manager = self._throttling_manager |
There was a problem hiding this comment.
I'm not sure if we should use ThrottlingRequestManager by default - thoughts, @vdusek @Pijukatel, @Mantisus?
There was a problem hiding this comment.
I think it makes sense to enable it by default. When the crawler runs without a proxy, 429 will only increase the load on the site. This will not benefit either the site or the crawler.
Using a proxy requires a little configuration, so I think that an additional parameter to disable throttling for 429 will not complicate this.
There was a problem hiding this comment.
Please make the tests conform to the existing test structure. Perhaps @vdusek could provide further guidance here?
There was a problem hiding this comment.
Good call! I've removed all the Test... classes and rewritten the entirety of test_throttling_request_manager.py to use standard top-level async def test_...() functions with pytest fixtures to match test_request_list.py.
| """A request manager wrapper that enforces per-domain delays. | ||
|
|
||
| Handles both HTTP 429 backoff and robots.txt crawl-delay at the scheduling layer, | ||
| eliminating the busy-wait problem described in https://github.com/apify/crawlee-python/issues/1437. | ||
|
|
||
| Also addresses https://github.com/apify/crawlee-python/issues/1396 by providing a unified | ||
| delay mechanism for crawl-delay directives. | ||
| """ |
There was a problem hiding this comment.
No need to mention github issues here if they will be resolved by this PR.
| """ | ||
| self._inner = inner | ||
| self._domain_states: dict[str, _DomainState] = {} | ||
| self._buffered_requests: list[Request] = [] |
There was a problem hiding this comment.
This is a fundamental deviation from what I described - instead of storing the requests in memory, we should create a special request queue in memory (see #1762 (review)) for each domain. The ThrottlingRequestManager will then delegate fetch_next_request calls to sub-queues that are not currently delayed.
There was a problem hiding this comment.
You're absolutely right. This has been fully refactored. ThrottlingRequestManager now dynamically instantiates RequestQueue.open(alias=f"throttled-{domain}") for throttled domains. Requests are routed to these separate sub-queues to avoid memory pressure, and fetch_next_request pulls from them seamlessly when domain delays expire. State operations (like is_empty and get_handled_count) now aggregate cleanly across the inner queue and all sub-queues. This aligns with Crawlee's distributed design and ensures request persistence.
…queues and update its integration across crawlers.
|
Heads up @janbuchar @vdusek @Mantisus: I've pushed a significant refactor based on the latest feedback. Sub-queues over memory buffer: ThrottlingRequestManager now delegates to persistent per-domain sub-queues via RequestQueue.open(alias=f"throttled-{domain}") instead of keeping throttled requests in memory. Test Structure: Completely rewrote test_throttling_request_manager.py to drop the Test... classes and conform to Crawlee's standard test structure. BasicCrawler fixes: Addressed all inline nits (used isinstance(), renamed url to request_url in _raise_for_session_blocked_status_code, updated docstrings/comments). The tests track the routing origin and safely aggregate get_handled_count and is_empty metrics across the main queue and sub-queues. All 24 tests pass, and Ruff and Pytest issues have been resolved. Let me know if the updated delegation architecture feels right! |
Fixes #1437
Problem
When target websites return HTTP 429 (Too Many Requests), the
AutoscaledPoolscales UP instead of down — creating a "death spiral." This happens because:SessionError→ session retires → request retriedis_system_idlereturnsTrue_autoscale()sees idle CPU → increases concurrencyThe existing
_snapshot_clientonly tracks Apify storage API rate limits, not target website 429s.Solution
Following @Pijukatel's suggestion, I created a dedicated
RequestThrottlercomponent that handles 429 backoff per domain — theAutoscaledPoolis completely untouched.Key features:
example.comdoesn't affectother-site.com2s → 4s → 8s → ...capped at60sHow it works
BasicCrawler.__run_task_functionchecksRequestThrottler.is_throttled(url)before processing_raise_for_session_blocked_status_code, the domain is recordedRequestState.DONE), the backoff counter resetsFiles changed
src/crawlee/_request_throttler.pysrc/crawlee/crawlers/_basic/_basic_crawler.pysrc/crawlee/crawlers/_abstract_http/_abstract_http_crawler.pysrc/crawlee/crawlers/_playwright/_playwright_crawler.pytests/unit/test_request_throttler.pyTests
Future work
This is a focused first step toward the full
RequestAnalyzerthat @Pijukatel outlined (with robots.txt integration, URL group management, etc.).